Welcome to our advanced R project working on a marketing campaign data set focusing on term deposit offers! You can find this project also under the following GitHup Repository. This project focuses on the classification problem that a Portuguese bank was facing, which can be found under following link. The bank wanted to know which customers are more likely to accept a term deposit offer in a telemarketing campaign. Thus a predictive model to answer this question is created in order to optimize the bank’s marketing resources. The general idea behind this is that it is preferable to call customers that are more likely to accept the offer because each call means marketing cost and with such a predictive model not only cost can be controlled better but also the actual return on the marketing campaign can be improved.
Our worked is divided into two parts: Data exploration and modelling (including feature engineering). Thus two markdowns are provided to give a better overview.
But before we go into the project itself, lets introduce our team tackled this problem by dividing the machine learning process into three different work streams:
O1 R Masters :)
Before we start with the first part of our project, which is data exploration, we load the required R libraries from a separate R script called “libraries”.
source('notebooks/libraries.R')
We were provided with two data sets. The first is a train data set, which includes the actual outcome of the marketing activity for each contacted customer (labeled train data set). The second is a “blind” test data set that does not include the actual outcome of the campaign. At the end of our project we will use our best model to make predictions on this test data set. One can imagine this like putting the model into production a feeding it with new data in order to predict which customers will accept the offer and which will not. The overall data set contains 45,211 rows and the original data set can be found in the UCI Machine Learning Repository. It is important to note that the data sets we were provided with are not 100% the same compared to the repository. Some of the features from the repository are not included here while other features such as balance, education, day are added or transformed. However we know based on the information on the data set provided that the data covers activities from from May 2008 to November 2010. Comparing the size of the train and the blind test set we see that the over all data set was split into 80% train and 20% test.
The provided data sets contain general as well as banking specific information about each contacted client. Such data can usually be found in the CRM system of a bank. The bank intends to use this data in order to identify patterns that a related to the likeliness of a customer responding positive to the offered marketing campaign. It is important to notice that this campaign using phone calls is about a term deposit offer. A term deposit is a specific banking product that is a fixed-term investment that includes the deposit of money into an account at a financial institution. Term deposit investments usually carry short-term maturities ranging from one month to a few years and will have varying levels of required minimum deposits. Thus the developed predicted model should only be used for marketing campaigns around term deposits since the same customer can respond completely different when he is offered a different banking product such as a over draft protection or a loan.
Now that we know what the data set is about in general let us take a closer look at it and see what it actually contains in terms of features.
Both data sets include the same 16 features while the train set contains an additional column storing the outcome of the marketing campaign, which represents our target variable. Since the target variable is either yes (term deposit offer accepted) or no (term deposit offer not accepted), we are facing a binary classification problem.
The next step is to take a closer look at the provided features.
raw_train_data<-fread('Data/BankCamp_train.csv', stringsAsFactors = F)
raw_test_data<-fread('Data/BankCamp_test.csv', stringsAsFactors = F)
str(raw_train_data)
## Classes 'data.table' and 'data.frame': 36168 obs. of 17 variables:
## $ age : int 50 47 56 36 41 32 26 60 39 55 ...
## $ job : chr "entrepreneur" "technician" "housemaid" "blue-collar" ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education: chr "primary" "secondary" "primary" "primary" ...
## $ default : chr "yes" "no" "no" "no" ...
## $ balance : int 537 -938 605 4608 362 0 782 193 2140 873 ...
## $ housing : chr "yes" "yes" "no" "yes" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "unknown" "unknown" "cellular" "cellular" ...
## $ day : int 20 28 19 14 12 4 29 12 16 3 ...
## $ month : chr "jun" "may" "aug" "may" ...
## $ duration : int 11 176 207 284 217 233 297 89 539 131 ...
## $ campaign : int 15 2 6 7 3 3 1 2 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 276 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 2 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
## - attr(*, ".internal.selfref")=<externalptr>
str(raw_test_data)
## Classes 'data.table' and 'data.frame': 9043 obs. of 16 variables:
## $ age : int 58 43 51 56 32 54 58 54 32 38 ...
## $ job : chr "management" "technician" "retired" "management" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "primary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 593 229 779 23 529 -364 1291 0 424 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 55 353 164 160 1492 355 266 179 104 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## - attr(*, ".internal.selfref")=<externalptr>
For our following data exploration we are creating two different vectors, one that stores all the features that are categorical and one with the numeric/continuous features.
discrete_var<-c("job", "marital", "education", "default", "housing", "loan","contact", "month", "poutcome")
continuous_var<-c("age", "balance","day", "duration", "campaign", "pdays", "previous")
First of all we take a look how our target is distributed. This means looking at how many of the contacted customers accepted the offer compared to how many did not. We see here that most of the customers did not accept the offer (88.4%). Only 11.6% of the customers that were contacted actually accepted the offer. If we assume that the customers were contacted randomly without any pre-selection, we can say that only one out of nine customers accept the offer.
This insight is very important for our later machine learning model. Because if we would predict all of the customers are a ‘no’ we would end up with an accuracy of almost 90%. Also if ‘no’ is defined as the positive class and we predict all of the rows as ‘no’ we end up with very high sensitivity results (True positives/[True positives+False negatives]). However, the idea behind this model is to identify customers that are more likely to accept our offer and thus we need not only to define ‘yes’ as the positive class but also use a sampling method to train our model due to the high class imbalance.
target_dist<-ggplot(raw_train_data, aes(y, fill=y))+
geom_bar() + scale_fill_manual(values=c("#995052", "#529950")) +
xlab("Campaign offer accepted")+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
ggtitle("Distribution of target variable") + theme_minimal() +
theme(text = element_text(face = "bold"), legend.position = "none",
panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
target_dist
The next step is to look at the distribution of our features. This helps us to identify if some variables are skewed or unbalanced. We will start in this case with continuous variables and look at the quintiles (including min and max), the mean and the standard deviation. This helps us especially when looking at the minimum to identify special cases such as pdays -1 compared to only looking at the histograms.
In the case of pdays we see we have more than 75% of the rows that contain a -1 which means the customer was never contacted before. This is also reflected in the feature previous that shows a 0 for most customers, counting the contacts performed before this campaign.
cont_var <- as.data.frame(raw_train_data)
cont_var <- cont_var[ , (names(cont_var) %in% continuous_var)]
distribution <- as.data.frame(t(sapply(cont_var, quantile)))
distribution$Mean <- sapply(cont_var, mean)
distribution$SD <- sapply(cont_var, sd)
datatable(round(distribution, 2))
Looking at the density plots of out continuous variables we can see that all of them are highly skewed to the right. The exception here is the day which reflects the day within the month. However, we see a variation across days and not a uniform distribution meaning we have days within the month were more customer are called compared to others.
cont_var_melt <- as.data.frame(melt(cont_var))
cont_dist <- ggplot(cont_var_melt, aes(value)) +
geom_density(aes(fill = variable)) +
facet_wrap(~variable, scales = "free", nrow = 3) +
labs(x = "", y = "", fill = "") + theme_minimal() +
scale_fill_tableau() + ggtitle("Distribution of each continous variable") +
theme(text = element_text(face = "bold"), legend.position = "none",
panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
cont_dist
An important aspect for the stability of a model is that the feature distributions for the train and the test set are similar. Thus we created for each feature a distribution chart that includes the distribution for it within the test and the train set. Overall, we see that the distribution for all variables match comparing train and test. This is an important aspect when deploying models in reality and making prediction on new data.
## Create a new column in the test dataset
raw_test_data$y <- NA
## Creating a column "dataType" for both train and test datasets and assign the value 'train' & 'test'
raw_train_data$dataType <- "train"
raw_test_data$dataType <- "test"
## Merging both train and test datasets
dataset <- rbind(raw_train_data, raw_test_data)
ggplot(dataset, aes(x=age, color = dataType)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Age Distribution") + theme_classic() +
scale_color_manual(values=c("#e08926", "#3526e0"))
ggplot(dataset, aes(x=balance, color = dataType)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Balance Distribution") + theme_classic() +
scale_color_manual(values=c("#e08926", "#3526e0"))
Day within the month
ggplot(dataset, aes(x=day, color = dataType)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Day Distribution") + theme_classic() +
scale_color_manual(values=c("#e08926", "#3526e0"))
Call length
ggplot(dataset, aes(x=duration, color = dataType)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Duration Distribution") + theme_classic() +
scale_color_manual(values=c("#e08926", "#3526e0"))
Number of contacts performed during this campaign
ggplot(dataset, aes(x=campaign, color = dataType)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Campaign Distribution") + theme_classic() +
scale_color_manual(values=c("#e08926", "#3526e0"))
Number of days that passed by after the client was last contacted from a previous campaign
ggplot(dataset, aes(x=pdays, color = dataType)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("pdays Distribution") + theme_classic() +
scale_color_manual(values=c("#e08926", "#3526e0"))
Number of contacts performed before this campaign
ggplot(dataset, aes(x=previous, color = dataType)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Previous Distribution") + theme_classic() +
scale_color_manual(values=c("#e08926", "#3526e0"))
df_disc <- raw_train_data[, ..discrete_var]
df_disc <- sapply(df_disc, as.factor)
df_disc <- as.data.frame(melt(df_disc))
disc_dist <- ggplot(df_disc, aes(value)) +
geom_bar(aes(fill = Var2)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
scale_x_discrete(expand = c(0,0)) +
facet_wrap(~Var2, scales = "free", nrow = 2) +
scale_fill_tableau() +
ggtitle("Count of each discrete variable") +
labs(fill = "", x = "", y = "") +
theme_minimal() +
theme(text = element_text(face = "plain"),
legend.position = "none",
axis.text.x = element_text(size = 7, angle = 90),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
disc_dist
job_train <- ggplot(raw_train_data, aes(x=job)) +
geom_bar(fill ="#3526e0") + ggtitle("Job Distribution - Train") + labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"), legend.position = "none",
axis.text.x = element_text(size = 7, angle = 90), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
job_test <- ggplot(raw_test_data, aes(x=job)) +
geom_bar(fill = "#e08926") + ggtitle("Job Distribution - Test") + labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"), legend.position = "none", axis.text.x = element_text(size = 7, angle = 90), panel.grid.major=element_blank(),
panel.grid.minor = element_blank(),plot.title = element_text(hjust = 0.5))
grid.arrange(job_train, job_test, ncol=2)
marital_train <- ggplot(raw_train_data, aes(x=marital)) +
geom_bar(fill ="#3526e0") + ggtitle("Marital Distribution - Train") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
marital_test <- ggplot(raw_test_data, aes(x=marital)) +
geom_bar(fill = "#e08926") + ggtitle("Marital Distribution - Test") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
grid.arrange(marital_train, marital_test, ncol=2)
education_train <- ggplot(raw_train_data, aes(x=education)) +
geom_bar(fill ="#3526e0") + ggtitle("Education Distribution - Train") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
education_test <- ggplot(raw_test_data, aes(x=education)) +
geom_bar(fill = "#e08926") + ggtitle("Education Distribution - Test") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
grid.arrange(education_train, education_test, ncol=2)
Credit in default?
default_train <- ggplot(raw_train_data, aes(x=default)) +
geom_bar(fill ="#3526e0") + ggtitle("Default Distribution - Train") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
default_test <- ggplot(raw_test_data, aes(x=default)) +
geom_bar(fill = "#e08926") + ggtitle("Default Distribution - Test") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
grid.arrange(default_train, default_test, ncol=2)
Has housing loan?
housing_train <- ggplot(raw_train_data, aes(x=housing)) +
geom_bar(fill ="#3526e0") + ggtitle("Housing Distribution - Train") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
housing_test <- ggplot(raw_test_data, aes(x=housing)) +
geom_bar(fill = "#e08926") + ggtitle("Housing Distribution - Test") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
grid.arrange(housing_train, housing_test, ncol=2)
Has personal loan?
loan_train <- ggplot(raw_train_data, aes(x=loan)) +
geom_bar(fill ="#3526e0") + ggtitle("Loan Distribution - Train") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
loan_test <- ggplot(raw_test_data, aes(x=loan)) +
geom_bar(fill = "#e08926") + ggtitle("Loan Distribution - Test") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
grid.arrange(loan_train, loan_test, ncol=2)
Communication type
contact_train <- ggplot(raw_train_data, aes(x=contact)) +
geom_bar(fill ="#3526e0") + ggtitle("Contact Distribution - Train") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
contact_test <- ggplot(raw_test_data, aes(x=contact)) +
geom_bar(fill = "#e08926") + ggtitle("Contact Distribution - Test") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
grid.arrange(contact_train, contact_test, ncol=2)
month_train <- ggplot(raw_train_data, aes(x=month)) +
geom_bar(fill ="#3526e0") + ggtitle("Month Distribution - Train") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
month_test <- ggplot(raw_test_data, aes(x=month)) +
geom_bar(fill = "#e08926") + ggtitle("Month Distribution - Test") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
grid.arrange(month_train, month_test, ncol=2)
Outcome previous campaign offer
poutcome_train <- ggplot(raw_train_data, aes(x=poutcome)) +
geom_bar(fill ="#3526e0") + ggtitle("poutcome Distribution - Train") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
poutcome_test <- ggplot(raw_test_data, aes(x=poutcome)) +
geom_bar(fill = "#e08926") + ggtitle("poutcome Distribution - Test") +
labs(y = "", x="") + theme_minimal() + theme(text = element_text(face = "plain"),
legend.position = "none", panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))
grid.arrange(poutcome_train, poutcome_test, ncol=2)
When we create machine learning models an important factor to take into account is correlation among features. Using highly correlated features in a machine learning model can lead to the problem of multicolliniarity and resulting in an unstable model. Thus we are checking the correlation among out numeric features. Overall, we see that only pdays and previous seem to be correlated. This makes sense since it reflects the number of days that have passed since the client was last contacted and the actual number of contacts for this client. In the later process we will see if dropping one of the features or creating a combination out of the two will help to improve our model.
correlation <- cor(cont_var)
p <- plot_ly(
x = continuous_var,
y=continuous_var,
z = correlation,
type = "heatmap",
colorscale=list(c(0, "rgb(0,0,255)"),
list(0.1, "rgb(51,153,255)"),
list(0.2, "rgb(102,204,255)"),
list(0.3, "rgb(153,204,255)"),
list(0.4, "rgb(204,204,255)"),
list(0.5, "rgb(255,255,255)"),
list(0.6, "rgb(255,204,255)"),
list(0.7, "rgb(255,153,255)"),
list(0.8, "rgb(255,102,204)"),
list(0.9, "rgb(255,102,102)"),
list(1, "rgb(255,0,0)")),
title = "Correlation Matrix")
p <- layout(p, title="Continuous Variable Correlations")
p
We already saw in the distributions that we have skewed and unbalanced features. Thus we need to take a look at outliers. Outliers might actually be rows than contain noise rather than actual information. If this is actually the case these rows should be removed.
For the continuous variables we see that in particular for balance and duration we have several outliers. Since we are not using duration at the end as a feature though that is not a problem. For balance such outliers make sense since there people with significantly higher balances than others.
cont_box <- ggplot(cont_var_melt, aes(variable, value)) +
geom_boxplot(aes(fill = variable)) +
coord_flip() +
scale_fill_tableau() +
labs(x = "", y = "") +
theme_minimal() +
theme(text = element_text(face = "bold"),
legend.position = "none",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank())
cont_box
df_cont_norm <- raw_train_data[,..continuous_var]
df_cont_norm <- as.data.frame(apply(df_cont_norm, 2,function(x)((x - min(x))/(max(x)-min(x)))))
df_cont_norm <- as.data.frame(melt(df_cont_norm))
cont_box_norm <- ggplot(df_cont_norm, aes(variable, value)) +
geom_boxplot(aes(fill = variable)) +
coord_flip() +
scale_fill_tableau() +
labs(x = "", y = "") +
theme_minimal() +
theme(text = element_text(face = "bold"),
legend.position = "none",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank())
cont_box_norm
For discrete features we actually see that several features show imbalances. However, these imbalances might actually be helpful since there might be a relationship to the actual campaign outcome. We will look at this in the next step.
disc_box <- ggplot(df_disc, aes(Var2, as.numeric(value))) +
geom_boxplot(aes(fill = Var2)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
scale_x_discrete(expand = c(0,0)) +
facet_wrap(~Var2, scales = "free", ncol = 1) +
scale_fill_tableau() +
ggtitle("Distribution of each discrete variable") +
labs(fill = "", x = "", y = "") +
coord_flip() +
theme_light() +
theme(text = element_text(face = "bold"),
legend.position = "none",
axis.text.x = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
strip.background = element_blank(),
strip.text.x = element_blank())
disc_box
From this graph, we can see that as the age increases the density of people accepting or rejecting the campaign decreases. In addition, we noticed that the density for people accepting the campaign over the people rejecting it, is higher when the age is above 60 years old, meaning that older people are more likely to say yes/accept the offer. Whereas, people between the ages of 30 and 60 tend to reject the offer (higher density for the “no” than for the “yes”).
ggplot(raw_train_data, aes(x=age, color = y)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Age Target Distribution") + theme_classic() +
labs(color = "Offer accepted") +
scale_color_manual(values=c("#995052", "#529950"))
According to this graph, both distributions, for the offer accepted and for the rejected offer, share the same distribution and the same peak but we noticed that the peak for the “no” / the offer rejected is higher. This means that people with higher balances are more likely to accept the offer.
ggplot(raw_train_data, aes(x=balance, color = y)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Balance Target Distribution") + theme_classic() +
labs(color = "Offer accepted") +
scale_color_manual(values=c("#995052", "#529950"))
We can see that at the very beginning of the month, people are more likely to accept the offer rather than reject it (with a distribution for “yes” higher than a distribution for “no”), because logically, they would have earned their salary, so they are more willing to pay. On the contrary, at the end of the month, the distribution for “no” is higher than the distribution for “yes” as the end of the month is approaching, and people are less likely to accept the offer.
ggplot(raw_train_data, aes(x=day, color = y)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Day Target Distribution") + theme_classic() +
labs(color = "Offer accepted") +
scale_color_manual(values=c("#995052", "#529950"))
From this graph, we can conclude that as the duration of the call is smaller, the likelihood of rejecting the offer is higher. And as the duration is longer, the likelihood of accepting the offer becomes higher. This attribute highly affects the output target because if duration – 0 then y would be a “no”. Duration is not something we would have at time of prediction, so using it would be considered cheating. That’s why later on, we would discard it, in order to have a more realistic predictive model.
ggplot(raw_train_data, aes(x=duration, color = y)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Duration Target Distribution") + theme_classic() +
labs(color = "Offer accepted") +
scale_color_manual(values=c("#995052", "#529950"))
We can see a difference between the distribution of the people who have accepted and the people who have rejected the offer based on the campaign (number of contacts performed). We can slightly see from both distributions that as the number of contacts increases the likelihood of people accepting the offer is a bit higher but that is not always the case.
ggplot(raw_train_data, aes(x=campaign, color = y)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Campaign Target Distribution") + theme_classic() +
labs(color = "Offer accepted") +
scale_color_manual(values=c("#995052", "#529950"))
As the number of days that passed by after the client was last contacted from a previous campaign increases, the likelihood of people accepting the offer decreases which makes sense as the client would forget about the campaign if he was not’t constantly reminded of it.
ggplot(raw_train_data, aes(x=pdays, color = y)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("pdays Target Distribution") + theme_classic() +
labs(color = "Offer accepted") +
scale_color_manual(values=c("#995052", "#529950"))
At the very beginning of the x-axis we can see that as the number of contacts performed before this campaign (previous) increases, the likelihood of people accepting the offer will be higher than the likelihood of people rejecting the offer.
ggplot(raw_train_data, aes(x=previous, color = y)) +
geom_density(alpha = 0.7, size=1.5) +
ggtitle("Previous Target Distribution") + theme_classic() +
labs(color = "Offer accepted") +
scale_color_manual(values=c("#995052", "#529950"))
According to this graph, there are different ratios of people accepting/rejecting the offer per job category. For example, the ones that are most likely to accept the offer are students expressing a ratio of approx. 1:3 whereas the ones that have a lower ratio in accepting/rejecting the offer are the ones under the blue-collar category with a ratio of 1:13.
ggplot(raw_train_data,mapping = aes(job,fill=y))+
geom_bar(col="black", position = position_dodge(1))+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
theme_classic() +
scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))
From this graph, we can see that the divorced ones have a higher ratio of accepting/rejecting the offer than any other marital category (married, single) with a 1:8 ratio. Whereas the ones that have the lowest ratio among all three categories are the married ones with a ratio of 1:9 meaning they are more likely to reject the offer which makes the sense as married people are financially not only responsible for themselves but also for others such as kids, or one of the partners that does not work.
ggplot(raw_train_data,mapping = aes(marital,fill=y))+
geom_bar(col="black", position = position_dodge(1))+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
theme_classic() +
scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))
The highest ratio of accepting/rejecting the offer is for the unknown category with a ratio of 1:6. On the contrary, the ones that have the lowest ratio of accepting/rejecting the offer are the ones in the secondary category with a ratio of 1:9.
ggplot(raw_train_data,mapping = aes(education,fill=y))+
geom_bar(col="black", position = position_dodge(1))+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
theme_classic() +
scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))
From this plot, we can see that for the ones who defaulted on loan, they have a lower ratio of accepting/rejecting the offer with a ratio of around 1:7 whereas the ones who did not default on loan have a higher ratio of accepting the offer rather than rejecting it with a ratio of 1:17.
ggplot(raw_train_data,mapping = aes(default,fill=y))+
geom_bar(col="black", position = position_dodge(1))+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
theme_classic() +
scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))
For the ones that have a housing loan, it is less likely for them to accept the offer (lower ratio of accepting/rejecting the offer with a ratio of 1:13) whereas the ones that do not have a housing loan, are more likely to accept the offer than reject it with a ratio of 1:5.
ggplot(raw_train_data,mapping = aes(housing,fill=y))+
geom_bar(col="black", position = position_dodge(1))+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
theme_classic() +
scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))
According to the graph, for the ones that have a personal loan 1:7 accept the offer whereas for the ones that don’t have a personal loan 1:15 accept the offer.
ggplot(raw_train_data,mapping = aes(loan,fill=y))+
geom_bar(col="black", position = position_dodge(1))+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
theme_classic() +
scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))
The ones that were contacted by telephone (1:5) are more likely to accept the offer than the ones that were contacted by either cellular (1:6) or unknown type (1:27).
ggplot(raw_train_data,mapping = aes(contact,fill=y))+
geom_bar(col="black", position = position_dodge(1))+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
theme_classic() +
scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))
From this graph, we can see that the highest ratio of accepting/rejecting the offer is during the month of December and September with a ratio of respectively 1:1.5 and almost 1:1. But the lowest ratio which is 1:14 is during the month of May.
ggplot(raw_train_data,mapping = aes(month,fill=y))+
geom_bar(col="black", position = position_dodge(1))+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
theme_classic() +
scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))
When the outcome of the previous campaign is successful there is a much higher likelihood of having the offer accepted with a ratio of (2:1) but when the outcome of the campaign is unknown the ratio goes down to 1:10 meaning that there is a much higher likelihood of rejecting the offer rather than accepting it.
ggplot(raw_train_data,mapping = aes(poutcome,fill=y))+
geom_bar(col="black", position = position_dodge(1))+
geom_text(stat="count",aes(label=scales::percent((..count..)/sum(..count..))),
position=position_dodge(1), size=3, vjust = -.5)+
theme_classic() +
scale_fill_manual("Offer Accepted", values = c("no" = "#995052", "yes" = "#529950"))
First of all we saw that there are no missing values within our data set, which is important in regards of data cleaning. Furthermore our blind test data seems to be a random sample from the complete data set since the feature distributions for train and test data seem to be similar. For our later models we will test to scale some variables due to the high skewness and also create features based on numeric combinations since we saw that some of these variables are not only correlated but also from a business perspective seem to be related. When looking at the target distribution across variables we saw that duration of the phone seems to differ with regard of a successful and an unsuccessful offer, which hints that this would be an important feature for the models. However, since this cannot be observed BEFORE a call is actually made, this variable cannot be included within the model. Nevertheless the target distributions showed more insights such as a very high rejection rate during May for example.
After we are now done with data exploration and no data cleaning has to be performed at this stage we can move on to our predictive model, which can be found in the other markdown.